17 research outputs found

    Automated Code Generation for Lattice Quantum Chromodynamics and beyond

    Get PDF
    We present here our ongoing work on a Domain Specific Language which aims to simplify Monte-Carlo simulations and measurements in the domain of Lattice Quantum Chromodynamics. The tool-chain, called Qiral, is used to produce high-performance OpenMP C code from LaTeX sources. We discuss conceptual issues and details of implementation and optimization. The comparison of the performance of the generated code to the well-established simulation software is also made

    Automated Code Generation for Lattice QCD Simulation

    Get PDF
    Quantum Chromodynamics (QCD) is the theory of strong nuclear force, responsible of the interactions between sub-nuclear particles. QCD simulations are typically performed through the lattice gauge theory approach, which provides a discrete analytical formalism called LQCD (Lattice Quantum Chromodynamics). LQCD simulations usually involve generating and then processing data on petabyte scale which demands multiple teraflop-years on supercomputers. Large parts of both, generation and analysis, can be reduced to the inversion of an extremely large matrix, the so-called Wilson-Dirac operator. For this purpose, and because this matrix is always sparse and structured, iterative methods are definitely considered. Therefore, the procedure of the application of this operator, resulting in a vector-matrix product, appears as a critical computation kernel that should be optimized as much as possible. Evaluating the Wilson-Dirac operator involves symmetric stencil calculations where each node has 8 neighbors. Such configuration is really hindering when it comes to memory accesses and data exchanges among processors. For current and future generation of supercomputers the hierarchical memory structure make it next to impossible for a physicist to write an efficient code. Addressing these issues in other to harvest an acceptable amount of computing cycles for the real need, which means reaching a good level of efficiency, is the main concern of this paper. We present here a Domain Specific Language and corresponding toolkit, called QIRAL, which is a complete solution from symbolic notation to simulation code

    Automated Code Generation for Lattice QCD Simulation

    Get PDF
    Quantum Chromodynamics (QCD) is the theory of strong nuclear force, responsible of the interactions between sub-nuclear particles. QCD simulations are typically performed through the lattice gauge theory approach, which provides a discrete analytical formalism called LQCD (Lattice Quantum Chromodynamics). LQCD simulations usually involve generating and then processing data on petabyte scale which demands multiple teraflop-years on supercomputers. Large parts of both, generation and analysis, can be reduced to the inversion of an extremely large matrix, the so-called Wilson-Dirac operator. For this purpose, and because this matrix is always sparse and structured, iterative methods are definitely considered. Therefore, the procedure of the application of this operator, resulting in a vector-matrix product, appears as a critical computation kernel that should be optimized as much as possible. Evaluating the Wilson-Dirac operator involves symmetric stencil calculations where each node has 8 neighbors. Such configuration is really hindering when it comes to memory accesses and data exchanges among processors. For current and future generation of supercomputers the hierarchical memory structure make it next to impossible for a physicist to write an efficient code. Addressing these issues in other to harvest an acceptable amount of computing cycles for the real need, which means reaching a good level of efficiency, is the main concern of this paper. We present here a Domain Specific Language and corresponding toolkit, called QIRAL, which is a complete solution from symbolic notation to simulation code

    CASH: Revisiting hardware sharing in single-chip parallel processor

    Get PDF
    As the increasing of issue width has diminishing returns with superscalar processor, thread parallelism with a single chip is becoming a reality. In the past few years, both SMT (Simultaneous MultiThreading) and CMP (Chip MultiProcessor) approaches were first investigated by academics and are now implemented by the industry. In some sense, CMP and SMT represent two extreme design points

    CASH Design Space Exploration

    Get PDF
    As the increasing of issue width has diminishing returns with superscalar processor, thread parallelism with a single chip is becoming a reality. In the past few years, both SMT and CMP approaches were first investigated by academics and are now implemented by the industry. In some sense, SMT and CMP represent two extreme design points. CASH parallel processor (for CMP And SMT Hybird) is a possible intermediate design points for on-chip thread parallelism in terms of design complexity and hardware sharing. It retains resource sharing as SMT when such a sharing can be made non-critical for implementation, but resource splitting as CMP wherever resource sharing leads to a superlinear increase of the implementation hardware complexity. This paper explores the multi-dimensional design space for CASH architecture. It compares the performance of single thread running on CASH, SMT and CMP processors. And then the performances of multi-program workloads and parallel workloads are investigated in these processors. At last, It explores the performance varies on CASH with the changing of cache size, and number of associativity of cache. The experiment results show that the CASH processor has a great potential to improve the performances of single thread workload and most of the multi-program workloads, and at the same time maintains a low implementation complexity than the SMT and CMP

    An Hybrid Data Transfer Optimization Technique for GPGPU

    Get PDF
    Graphical Processing Units (GPU) can provide tremendous computing power. Current NVidia and ATI hardware display a peak performance of hundreds of gigaflops. However, because of the data transfer speed between CPU and GPU is limited, those devices are difficult to use to accelerate numerical applications. In this paper we propose a software hybrid technique for automatically optimizing data transfer based on static and dynamic information on data accesses
    corecore